1 Class Topics

So far, we have learned about the following topics. For this project, the following options will be pursued for modeling and detailed analysis.

  1. Linear Regression
  2. KNN
  3. ANN
  4. SVM
  5. Decision Trees
  6. Random Forest

2 Motivation

Citi Bike is a bicycle sharing service in New York city operated by Motivate, an organization that manages some of the globe’s largest bike networks in large cities. Because bikes must be picked up and dropped off at Citi Bike docking stations, people can easily take one-way trips and drop their bike off at 1 of 900 stations.

With Citi Bikes being available 24 hours/day, 7 days/week, 365 days/year, NYC’s enormous population, and variable weather, we believe there are insights to be gleaned from historical rider and local weather data. There are also opportunities for improving safety by sending notifications to riders’ phones indicating riding conditions and availability.

This report will use Citi Bike rider and NYC weather data to predict daily customer behavior and describe the relative impact of different weather events on bike usage.

3 Read Me

This document presents the group project for which city bike data in NYC is used. The purpose of this document is to share analysis of the influence of weather to the bike usage.

Here, city bike data from 2019 will be used for the analysis. Specifically, 5% of the entire 2019 data was extracted, and the extracted data was categorized into average ride usage per day (365 days in total). Thereafter, the yearly usage data was randomly divided into train (80%, 292 days) and test (20%, 73 days) for the modeling, testing the accuracy of the models, and business analytics.

Ultimately, the findings will be used to identify the relationship between weather and bike usage and predict the bike usage based on chosen parameters (e.g. precipitation amount). Thereafter, some insights will be presented for the bike business.

Note that there was an effort to retrieve station capacity information into the dataset. However, it was not pursued as only 20% of the stations in the dataset were identified with the capacity.

4 Dataset Explanation

  • City bike data in NYC
  • Each dataset is comprised of 1 month trip data
    • Data from 12 months of 2019 have been combined and divided into train (80%) and test (20%) datasets.
  • Variables (City bike):
    • trip duration (sec)
    • start time (year, month, day, time)
    • stop time (year, month, day, time)
    • start station (id, lat/long)
    • end station (id, lat/long)
    • bike id
    • user type
      • customer (24 hr vs 3 day)
      • subscriber (annual)
    • birth year
    • gender (0, 1, 2 = unknown, male, female)
    • trip distance (based on the lat/long, trip distance was calculated with r package haversine)
    • trip speed (based on trip distance, trip speed was calculated with trip duration)
    • total # bike rentals (per day)
  • Variables (Weather):
    • TMIN (minimum temperature in C)
    • TMAX (maximum temperature in C)
    • Average wind speed
    • PRCP (mm)
    • SNOW (mm)

5 Data Collection And Preparation

NOTE: Due to extreme large raw data set size, a separate RMD is used to process raw data and generate the dataset used for all analysis in this RMD report.

5.1 Citi Bike Data

The raw Citi bike 2019 usage dataset is obtained from the website “https://s3.amazonaws.com/tripdata/index.html”. All 12 months of data in 2019 is combined first. Due to the extreme large data size (over 22 million rows), a sample size of 5% of the raw dataset is generated using set.seed() function resulting in over 1 million rows of data. Then a few data cleaning processes were completed:

  1. Extract year, month, day, hour and minute from the column “Date” which contains time information all in one cell which is hard to be used for further analysis.
  2. Rows with birth year before 1920 are converted to the median of the whole dataset. The team assumed people over 100 years old might not be able to ride bikes and it was just a joke from people who filled the survey.
  3. Rows with trip duration above the 99.9 percentile is removed. Some outliers were noticed such as a trip with 3 million seconds. These outliers would skew the analysis and modeling. Therefore, the team decided to remove them.

5.2 Weather Data

The raw NYC weather data is obtained from NOAA website through Python API. The year, month, day, hour and minute information are extracted from the “Date” column.

5.3 Data Merging

The bike data and weather data were merged using the year, month and day as the relational keys. The final combined dataset were then separated into train (80%) and test (20%) dataset and stored in files “train_final.csv” and “test_final.csv”. The team didn’t realize that train/test separation should not have been completed in such an early stage until later in the process. However, it was very time consuming to go through everything discussed above just for a combined 5% dataset. Therefore, the team decided to just aggregate both “train_final.csv” and “test_final.csv” at the beginning of further analysis which is much less time consuming.

7 Clean up Datasets

Ensure that no additional clean up is needed after train and test datasets are extracted from the entire 2019 usage.

##  [1] "start_month"       "start_day"         "start_station_id" 
##  [4] "usertype"          "day"               "age_group"        
##  [7] "time"              "day_count"         "avg_wind_speed"   
## [10] "TMIN"              "TMAX"              "PRCP"             
## [13] "SNOW"              "avg_trip_duration" "avg_speed"        
## [16] "frequency"
## 'data.frame':    551829 obs. of  16 variables:
##  $ start_month      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ start_day        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ start_station_id : int  749 792 89 397 199 200 778 784 206 88 ...
##  $ usertype         : Factor w/ 2 levels "Customer","Subscriber": 2 2 2 2 2 2 2 2 2 2 ...
##  $ day              : Factor w/ 7 levels "Friday","Monday",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ age_group        : Factor w/ 4 levels "Adult","Elderly",..: 2 1 3 1 3 3 1 1 1 3 ...
##  $ time             : Factor w/ 4 levels "afternoon","evening",..: 1 3 2 2 1 3 1 1 2 2 ...
##  $ day_count        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ avg_wind_speed   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TMIN             : num  3.9 3.9 3.9 3.9 3.9 3.9 3.9 3.9 3.9 3.9 ...
##  $ TMAX             : num  14.4 14.4 14.4 14.4 14.4 14.4 14.4 14.4 14.4 14.4 ...
##  $ PRCP             : num  1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 ...
##  $ SNOW             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ avg_trip_duration: num  1378 820 1287 950 192 ...
##  $ avg_speed        : num  2.6 1.91 2.63 1.84 2.47 ...
##  $ frequency        : int  1 1 1 2 1 3 3 2 2 1 ...
##   start_month       start_day    start_station_id       usertype     
##  Min.   : 1.000   Min.   : 1.0   Min.   :  1.0    Customer  : 94173  
##  1st Qu.: 4.000   1st Qu.: 8.0   1st Qu.:186.0    Subscriber:457656  
##  Median : 7.000   Median :15.0   Median :411.0                       
##  Mean   : 6.646   Mean   :15.2   Mean   :428.2                       
##  3rd Qu.: 9.000   3rd Qu.:22.0   3rd Qu.:710.0                       
##  Max.   :12.000   Max.   :31.0   Max.   :828.0                       
##                                                                      
##         day             age_group             time          day_count    
##  Friday   :86283   Adult     :334701   afternoon:139095   Min.   :  1.0  
##  Monday   :77317   Elderly   : 17920   evening  :164354   1st Qu.:109.0  
##  Saturday :86853   Middle Age:190794   morning  :163446   Median :190.0  
##  Sunday   :71198   Teenager  :  8414   night    : 84934   Mean   :186.1  
##  Thursday :77026                                          3rd Qu.:258.0  
##  Tuesday  :83447                                          Max.   :364.0  
##  Wednesday:69705                                                         
##  avg_wind_speed       TMIN             TMAX            PRCP       
##  Min.   :0.000   Min.   :-16.60   Min.   :-9.90   Min.   : 0.000  
##  1st Qu.:1.200   1st Qu.:  3.90   1st Qu.:11.70   1st Qu.: 0.000  
##  Median :1.600   Median : 12.80   Median :21.70   Median : 0.000  
##  Mean   :1.712   Mean   : 11.35   Mean   :19.35   Mean   : 2.744  
##  3rd Qu.:2.300   3rd Qu.: 18.90   3rd Qu.:27.20   3rd Qu.: 1.500  
##  Max.   :5.700   Max.   : 26.70   Max.   :35.00   Max.   :46.500  
##                                                                   
##       SNOW          avg_trip_duration   avg_speed       frequency     
##  Min.   : 0.00000   Min.   :   61.0   Min.   :0.000   Min.   : 1.000  
##  1st Qu.: 0.00000   1st Qu.:  392.0   1st Qu.:1.935   1st Qu.: 1.000  
##  Median : 0.00000   Median :  647.3   Median :2.481   Median : 1.000  
##  Mean   : 0.07141   Mean   :  848.1   Mean   :2.408   Mean   : 1.451  
##  3rd Qu.: 0.00000   3rd Qu.: 1080.0   3rd Qu.:2.967   3rd Qu.: 2.000  
##  Max.   :10.20000   Max.   :17859.0   Max.   :8.432   Max.   :23.000  
## 
##  [1] "start_month"       "start_day"         "start_station_id" 
##  [4] "usertype"          "day"               "age_group"        
##  [7] "time"              "day_count"         "avg_wind_speed"   
## [10] "TMIN"              "TMAX"              "PRCP"             
## [13] "SNOW"              "avg_trip_duration" "avg_speed"        
## [16] "frequency"
## 'data.frame':    148349 obs. of  16 variables:
##  $ start_month      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ start_day        : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ start_station_id : int  813 530 638 821 318 65 86 345 128 788 ...
##  $ usertype         : Factor w/ 2 levels "Customer","Subscriber": 2 2 1 2 2 2 2 2 2 2 ...
##  $ day              : Factor w/ 7 levels "Friday","Monday",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ age_group        : Factor w/ 4 levels "Adult","Elderly",..: 1 1 3 1 1 1 1 1 1 1 ...
##  $ time             : Factor w/ 4 levels "afternoon","evening",..: 1 2 1 3 3 1 1 2 3 2 ...
##  $ day_count        : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ avg_wind_speed   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TMIN             : num  -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 ...
##  $ TMAX             : num  9.4 9.4 9.4 9.4 9.4 9.4 9.4 9.4 9.4 9.4 ...
##  $ PRCP             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SNOW             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ avg_trip_duration: num  516 685 720 532 2042 ...
##  $ avg_speed        : num  2 2.69 1.66 3.36 1.81 ...
##  $ frequency        : int  2 3 1 1 2 2 2 2 1 2 ...
##   start_month       start_day     start_station_id       usertype     
##  Min.   : 1.000   Min.   : 2.00   Min.   :  1.0    Customer  : 27591  
##  1st Qu.: 5.000   1st Qu.: 9.00   1st Qu.:184.0    Subscriber:120758  
##  Median : 7.000   Median :18.00   Median :409.0                       
##  Mean   : 7.275   Mean   :17.42   Mean   :426.9                       
##  3rd Qu.:10.000   3rd Qu.:26.00   3rd Qu.:708.0                       
##  Max.   :12.000   Max.   :31.00   Max.   :827.0                       
##                                                                       
##         day             age_group            time         day_count    
##  Friday   :18062   Adult     :90298   afternoon:37667   Min.   :  6.0  
##  Monday   :19895   Elderly   : 4668   evening  :44357   1st Qu.:147.0  
##  Saturday :17199   Middle Age:50931   morning  :43497   Median :205.0  
##  Sunday   :19230   Teenager  : 2452   night    :22828   Mean   :207.4  
##  Thursday :22388                                        3rd Qu.:280.0  
##  Tuesday  :19484                                        Max.   :365.0  
##  Wednesday:32091                                                       
##  avg_wind_speed      TMIN             TMAX            PRCP       
##  Min.   :0.00   Min.   :-14.30   Min.   : 0.00   Min.   : 0.000  
##  1st Qu.:1.30   1st Qu.:  7.80   1st Qu.:16.10   1st Qu.: 0.000  
##  Median :1.60   Median : 13.90   Median :21.10   Median : 0.000  
##  Mean   :1.82   Mean   : 12.96   Mean   :21.09   Mean   : 2.914  
##  3rd Qu.:2.20   3rd Qu.: 19.40   3rd Qu.:28.30   3rd Qu.: 1.000  
##  Max.   :5.20   Max.   : 27.80   Max.   :35.00   Max.   :46.200  
##                                                                  
##       SNOW          avg_trip_duration   avg_speed       frequency     
##  Min.   :0.000000   Min.   :   61.0   Min.   :0.000   Min.   : 1.000  
##  1st Qu.:0.000000   1st Qu.:  401.0   1st Qu.:1.911   1st Qu.: 1.000  
##  Median :0.000000   Median :  664.5   Median :2.466   Median : 1.000  
##  Mean   :0.006377   Mean   :  864.2   Mean   :2.386   Mean   : 1.479  
##  3rd Qu.:0.000000   3rd Qu.: 1104.0   3rd Qu.:2.950   3rd Qu.: 2.000  
##  Max.   :1.000000   Max.   :17862.0   Max.   :7.206   Max.   :21.000  
## 

8 Further Data Exploration

Trip duration is heavily skewed to left although there are decent amount of outliers in the upper end. Note that the extreme 0.5% was removed (e.g. trip duration of 1+ day).

User group appears to be mainly from adult (20-44) and middle age (45-64).

User type turns out to have very high ratio of subscriber.

##      Adult    Elderly Middle Age   Teenager 
##     334701      17920     190794       8414
##   Customer Subscriber 
##      94173     457656

9 Data Preparation for Modeling

Data has been reorganized to split the data per each day. Information about start day and end day are treated separately. The information of interest are: average birth year, average trip duration, and total check-ins/check-outs per given day.

## Loading required package: plyr
##   start_month start_day avg_wind_speed TMIN TMAX PRCP SNOW avg_trip_duration
## 1           1         1              0  3.9 14.4  1.5    0          917.5800
## 2           1         2              0  1.7  4.4  0.0    0          729.0894
## 3           1         3              0  2.8  6.7  0.0    0          724.9942
## 4           1         4              0  1.7  8.3  0.0    0          739.6375
## 5           1         5              0  5.0  8.3 12.7    0          659.4482
## 6           1         7              0 -3.8  1.1  0.0    0          670.9264
##   avg_speed frequency
## 1  2.238813      1136
## 2  2.521008      1866
## 3  2.571871      2065
## 4  2.543094      2163
## 5  2.518628       846
## 6  2.626848      1860
##   start_month       start_day     avg_wind_speed       TMIN        
##  Min.   : 1.000   Min.   : 1.00   Min.   :0.000   Min.   :-16.600  
##  1st Qu.: 3.000   1st Qu.: 8.00   1st Qu.:0.900   1st Qu.:  1.100  
##  Median : 6.000   Median :15.00   Median :1.600   Median :  9.400  
##  Mean   : 6.387   Mean   :15.23   Mean   :1.676   Mean   :  8.874  
##  3rd Qu.: 9.000   3rd Qu.:22.00   3rd Qu.:2.400   3rd Qu.: 17.200  
##  Max.   :12.000   Max.   :31.00   Max.   :5.700   Max.   : 26.700  
##       TMAX            PRCP             SNOW         avg_trip_duration
##  Min.   :-9.90   Min.   : 0.000   Min.   : 0.0000   Min.   : 542.2   
##  1st Qu.: 7.80   1st Qu.: 0.000   1st Qu.: 0.0000   1st Qu.: 706.2   
##  Median :17.20   Median : 0.000   Median : 0.0000   Median : 799.0   
##  Mean   :16.52   Mean   : 3.647   Mean   : 0.1414   Mean   : 814.8   
##  3rd Qu.:25.60   3rd Qu.: 2.500   3rd Qu.: 0.0000   3rd Qu.: 888.3   
##  Max.   :35.00   Max.   :46.500   Max.   :10.2000   Max.   :1168.4   
##    avg_speed       frequency   
##  Min.   :2.040   Min.   : 522  
##  1st Qu.:2.363   1st Qu.:1842  
##  Median :2.452   Median :2819  
##  Mean   :2.437   Mean   :2742  
##  3rd Qu.:2.545   3rd Qu.:3684  
##  Max.   :2.866   Max.   :4689
##   start_month start_day avg_wind_speed  TMIN TMAX PRCP SNOW avg_trip_duration
## 1           1         6              0  -0.5  9.4  0.0    0          733.9794
## 2           1        28              0  -3.8  3.3  0.0    0          650.7955
## 3           1        29              0  -3.8  6.1  5.8    0          658.9985
## 4           1        30              0 -14.3  1.7  0.3    1          636.5608
## 5           2         4              0   5.0 16.1  0.0    0          754.8000
## 6           2        18              0  -3.2  5.6  2.3    0          702.6888
##   avg_speed frequency
## 1  2.459605      1607
## 2  2.650070      1684
## 3  2.647706      1594
## 4  2.508030      1212
## 5  2.544449      2271
## 6  2.535576      1272
##   start_month       start_day     avg_wind_speed       TMIN       
##  Min.   : 1.000   Min.   : 2.00   Min.   :0.000   Min.   :-14.30  
##  1st Qu.: 5.000   1st Qu.: 9.00   1st Qu.:1.200   1st Qu.:  5.60  
##  Median : 7.000   Median :18.00   Median :1.600   Median : 12.80  
##  Mean   : 7.082   Mean   :17.68   Mean   :1.737   Mean   : 11.16  
##  3rd Qu.:10.000   3rd Qu.:26.00   3rd Qu.:2.200   3rd Qu.: 18.30  
##  Max.   :12.000   Max.   :31.00   Max.   :5.200   Max.   : 27.80  
##       TMAX            PRCP             SNOW        avg_trip_duration
##  Min.   : 0.00   Min.   : 0.000   Min.   :0.0000   Min.   : 624.2   
##  1st Qu.:12.80   1st Qu.: 0.000   1st Qu.:0.0000   1st Qu.: 754.8   
##  Median :20.00   Median : 0.000   Median :0.0000   Median : 831.1   
##  Mean   :19.09   Mean   : 3.877   Mean   :0.0137   Mean   : 839.8   
##  3rd Qu.:26.70   3rd Qu.: 1.800   3rd Qu.:0.0000   3rd Qu.: 913.2   
##  Max.   :35.00   Max.   :46.200   Max.   :1.0000   Max.   :1142.9   
##    avg_speed       frequency   
##  Min.   :2.057   Min.   : 713  
##  1st Qu.:2.341   1st Qu.:2271  
##  Median :2.417   Median :3203  
##  Mean   :2.408   Mean   :3005  
##  3rd Qu.:2.505   3rd Qu.:3903  
##  Max.   :2.712   Max.   :4889

10 Modeling

  1. Topics to be used 1. Linear Regression 3. KNN 4. ANN 5. SVM 6. Decision Trees 7. Random Forest
  2. Modeling 1. avg trip duration per day 2. avg speed per day 3. total check outs per day 4. avg travel distance per day

10.1 Avg trip duration per day

10.1.1 Linear model

## 
## Call:
## lm(formula = avg_trip_duration ~ start_month + start_day + avg_wind_speed + 
##     TMIN + TMAX + PRCP + SNOW, data = train_per_day)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -178.09  -48.73  -13.36   33.08  251.65 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    648.6231    18.4841  35.091  < 2e-16 ***
## start_month     -0.4084     1.6984  -0.240    0.810    
## start_day        0.2206     0.5404   0.408    0.683    
## avg_wind_speed   0.1718     4.7859   0.036    0.971    
## TMIN            -2.0381     1.8619  -1.095    0.275    
## TMAX            11.8893     1.6723   7.109 9.46e-12 ***
## PRCP            -3.9023     0.6183  -6.311 1.06e-09 ***
## SNOW             7.8154     5.4484   1.434    0.153    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 78.51 on 284 degrees of freedom
## Multiple R-squared:  0.6541, Adjusted R-squared:  0.6456 
## F-statistic: 76.73 on 7 and 284 DF,  p-value: < 2.2e-16

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
##  1  1  1  0  1  1  1  1  1  1  1  0  0  1  1  0  1  1  0  0  1  0  1  1  1  1 
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 
##  1  1  1  0  1  0  1  1  1  1  1  1  1  0  1  1  1  1  0  1  1  0  1  1  0  0 
## 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 
##  0  1  1  0  1  0  1  0  1  0  1  0  1  1  1  1  1  1  1  1  1

10.1.7 Assess Models

trip_duration_model_comparison = matrix(0, nrow=6, ncol=3)
colnames(trip_duration_model_comparison) = c("bad_prediction","good_prediction", "accuracy")
rownames(trip_duration_model_comparison) = c("linear","knn","ann","svm","decision_tree","random_forest")

trip_duration_model_comparison[1,] = c((length(trip_duration_accuracy_linear)-sum(trip_duration_accuracy_linear)), sum(trip_duration_accuracy_linear), round(sum(trip_duration_accuracy_linear)/length(trip_duration_accuracy_linear),3))
trip_duration_model_comparison[2,] = c((length(trip_duration_accuracy_knn)-sum(trip_duration_accuracy_knn)), sum(trip_duration_accuracy_knn), round(sum(trip_duration_accuracy_knn)/length(trip_duration_accuracy_knn),3))
trip_duration_model_comparison[3,] = c((length(trip_duration_accuracy_ann)-sum(trip_duration_accuracy_ann)), sum(trip_duration_accuracy_ann), round(sum(trip_duration_accuracy_ann)/length(trip_duration_accuracy_ann),3))
trip_duration_model_comparison[4,] = c((length(trip_duration_accuracy_svm)-sum(trip_duration_accuracy_svm)), sum(trip_duration_accuracy_svm), round(sum(trip_duration_accuracy_svm)/length(trip_duration_accuracy_svm),3))
trip_duration_model_comparison[5,] = c((length(trip_duration_accuracy_dt)-sum(trip_duration_accuracy_dt)), sum(trip_duration_accuracy_dt), round(sum(trip_duration_accuracy_dt)/length(trip_duration_accuracy_dt),3))
trip_duration_model_comparison[6,] = c((length(trip_duration_accuracy_rf)-sum(trip_duration_accuracy_rf)), sum(trip_duration_accuracy_rf), round(sum(trip_duration_accuracy_rf)/length(trip_duration_accuracy_rf),3))

10.1.8 Findings

Based on the different models assessed, SVM gave the best prediction of average trip duration with the accuracy of ~79.5%. This is considering that the results are accurate if lying within 10% range of the actual trip duration value.

From the linear model, it can be concluded that there is a relationship between average trip duration and different weather parameters. For example, it was found that TMAX had a positive relationship and PRCP had a negative relationship with average trip duration. From this, it can be concluded that warmer days have longer rides and rainy days have shorter rides in general.

Additionally, it was found that start month is negatively related to the average bike speed. This finding can be correlated with the weather-related finding noted above in a way that higher month (colder weather) results in faster bike speed.

There was no significant impact of variables TMIN, SNOW, start_month, and start_day. This indicates that the trip duration is not impacted by lower temperatures, snow amount, and trip month or day.

##               bad_prediction good_prediction accuracy
## linear                    20              53    0.726
## knn                       17              56    0.767
## ann                       36              37    0.507
## svm                       15              58    0.795
## decision_tree             27              46    0.630
## random_forest             27              46    0.630

10.2 Avg speed per day

10.2.1 Linear Model

## 
## Call:
## lm(formula = avg_speed ~ start_month + start_day + avg_wind_speed + 
##     TMIN + TMAX + PRCP + SNOW, data = train_per_day)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.27389 -0.07950  0.02405  0.07763  0.22919 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.6593008  0.0261865 101.552  < 2e-16 ***
## start_month    -0.0109588  0.0024061  -4.555 7.80e-06 ***
## start_day      -0.0010170  0.0007656  -1.328 0.185169    
## avg_wind_speed  0.0154658  0.0067803   2.281 0.023290 *  
## TMIN            0.0032415  0.0026377   1.229 0.220122    
## TMAX           -0.0121248  0.0023692  -5.118 5.71e-07 ***
## PRCP            0.0029408  0.0008760   3.357 0.000895 ***
## SNOW           -0.0134834  0.0077187  -1.747 0.081745 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1112 on 284 degrees of freedom
## Multiple R-squared:  0.4905, Adjusted R-squared:  0.4779 
## F-statistic: 39.06 on 7 and 284 DF,  p-value: < 2.2e-16

10.2.2 Linear Model - Interaction

For interaction, look at significant factors from simple linear regression

## 
## Call:
## lm(formula = avg_speed ~ start_month + start_day + avg_wind_speed + 
##     TMIN + TMAX + PRCP + SNOW + start_month * avg_wind_speed + 
##     start_month * TMAX + start_month * PRCP + avg_wind_speed * 
##     TMAX + avg_wind_speed * PRCP + TMAX * PRCP, data = train_per_day)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.26712 -0.06803  0.02498  0.07633  0.20333 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 2.644e+00  2.921e-02  90.526  < 2e-16 ***
## start_month                -1.480e-02  5.820e-03  -2.542 0.011557 *  
## start_day                  -1.013e-03  7.553e-04  -1.342 0.180816    
## avg_wind_speed              7.479e-02  1.783e-02   4.196 3.66e-05 ***
## TMIN                        2.017e-03  2.727e-03   0.740 0.460051    
## TMAX                       -1.145e-02  2.812e-03  -4.070 6.13e-05 ***
## PRCP                        2.404e-03  2.816e-03   0.854 0.393976    
## SNOW                       -1.076e-02  8.326e-03  -1.293 0.197227    
## start_month:avg_wind_speed -3.065e-03  1.787e-03  -1.715 0.087486 .  
## start_month:TMAX            5.913e-04  3.251e-04   1.819 0.069973 .  
## start_month:PRCP           -3.288e-05  3.049e-04  -0.108 0.914195    
## avg_wind_speed:TMAX        -2.935e-03  8.791e-04  -3.338 0.000959 ***
## avg_wind_speed:PRCP        -3.302e-04  8.275e-04  -0.399 0.690195    
## TMAX:PRCP                   7.507e-05  1.016e-04   0.739 0.460495    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1094 on 278 degrees of freedom
## Multiple R-squared:  0.5172, Adjusted R-squared:  0.4946 
## F-statistic: 22.91 on 13 and 278 DF,  p-value: < 2.2e-16

10.2.9 Findings

It turns out that knn is best at predicting the average speed with the accuracy of ~95%.

From the linear model, it can be concluded that there is a relationship between average bike speed and weather parameters. For example, it was found that wind speed/TMAX had negative relationship and PRCP had positive relationship with average bike speed. From this, it can be concluded that worse weather results in faster bike speed. Also, among significant factors from simple linear regression model, it was found that there is a negative interaction between average wind speed and TMAX (higher temperature, lower wind speed).

Additionally, it was found that start month is negatively related to the average bike speed. This finding can be correlated with the weather-related finding noted above in a way that higher month (colder weather) results in faster bike speed.

##               bad_prediction good_prediction accuracy
## linear                     5              68    0.932
## knn                        4              69    0.945
## ann                       10              63    0.863
## svm                        7              66    0.904
## decision_tree             13              60    0.822
## random_forest              8              65    0.890

10.3 Total check outs per day

10.3.1 Linear Model

## 
## Call:
## lm(formula = frequency ~ start_month + start_day + avg_wind_speed + 
##     TMIN + TMAX + PRCP + SNOW, data = train_per_day)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2040.10  -277.71    66.89   315.36  1231.84 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1739.344    123.646  14.067  < 2e-16 ***
## start_month       6.964     11.361   0.613  0.54040    
## start_day        -9.530      3.615  -2.636  0.00885 ** 
## avg_wind_speed    3.152     32.015   0.098  0.92164    
## TMIN             20.811     12.455   1.671  0.09583 .  
## TMAX             66.148     11.187   5.913 9.62e-09 ***
## PRCP            -47.821      4.136 -11.561  < 2e-16 ***
## SNOW            -37.774     36.446  -1.036  0.30087    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 525.2 on 284 degrees of freedom
## Multiple R-squared:  0.7712, Adjusted R-squared:  0.7656 
## F-statistic: 136.8 on 7 and 284 DF,  p-value: < 2.2e-16

10.3.8 Findings

From the results shown below, knn has the best accuracy for predicting check-outs with the accuracy of 0.945.

In addition, based on the linear model analysis, it shows that there is a relationship between check-outs amount and weather parameters. From the analysis, it appears that TMAX has a positive relationship with amount of check-outs while PRCP has a negative relationship.

##               bad_prediction good_prediction accuracy
## linear                    45              28    0.384
## knn                        2              71    0.973
## ann                       60              13    0.178
## svm                       42              31    0.425
## decision_tree             50              23    0.315
## random_forest             40              33    0.452

10.4 Avg travel distance per day

10.4.1 Linear model

## 
## Call:
## lm(formula = avg_distance ~ start_month + start_day + avg_wind_speed + 
##     TMIN + TMAX + PRCP + SNOW, data = train_per_day)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -315.47  -71.50   -4.91   52.84  388.18 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1745.5345    25.6925  67.939  < 2e-16 ***
## start_month      -8.7316     2.3607  -3.699  0.00026 ***
## start_day        -0.2766     0.7512  -0.368  0.71296    
## avg_wind_speed   11.5331     6.6524   1.734  0.08406 .  
## TMIN             -2.1249     2.5880  -0.821  0.41229    
## TMAX             18.4861     2.3245   7.953 4.35e-14 ***
## PRCP             -6.8540     0.8595  -7.975 3.77e-14 ***
## SNOW              8.0064     7.5731   1.057  0.29131    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 109.1 on 284 degrees of freedom
## Multiple R-squared:  0.7217, Adjusted R-squared:  0.7149 
## F-statistic: 105.2 on 7 and 284 DF,  p-value: < 2.2e-16

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
##  1  1  1  0  1  1  1  1  1  1  1  1  0  1  1  1  1  1  0  1  1  1  1  1  1  1 
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 
##  1  1  1  1  1  1  1  1  1  1  0  1  1  1  1  1  1  1  1  1  1  1  1  1  0  1 
## 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 
##  1  1  1  1  1  1  1  0  1  1  1  0  1  1  1  1  1  1  1  1  1

10.4.8 Findings

As with previous results, KNN was the best model with a prediction accuracy of ~97% with an error threshold of 10%.

The linear model shows notable linkages between average distance travelled in a day and start month, maximum temperature, and precipitation.

Ride distances increased precipitously with increase in max daily temperature, dropped with an increase in precipitation, and decreased with increasing calendar months. It is curious to note that through the course of a calendar year, ride distances decreased. One would expect distance to increase from January (winter weather) to July (summer weather) and subsequently decrease through December (return to winter weather).

Because of the inherent seasonality and human factors involved with the data, a linear regression paints a rough picture but other models such as KNN provide more accurate predictions.

##               bad_prediction good_prediction accuracy
## linear                     7              66    0.904
## knn                        2              71    0.973
## ann                       20              53    0.726
## svm                        7              66    0.904
## decision_tree             16              57    0.781
## random_forest              8              65    0.890

11 Conclusion

To be filled.